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ABSTRACT 

We present a new system for simultaneous estimation of 
keys, chords, and bass notes from music audio. It makes 
use of a novel chromagram representation of audio that 
takes perception of loudness into account. Furthermore, 
it is fully based on machine learning (instead of expert 
knowledge), such that it is potentially applicable to a wider 
range of genres as long as training data is available. As 
compared to other models, the proposed system is fast and 
memory efficient, while achieving state-of-the-art perfor- 
mance. 

1. INTRODUCTION 

Chords, along with the key and bassline, are essential mid- 
level features of western tonal music, and their evolution 
is fundamental to musical analysis. In recent years, au- 
dio chord transcription and tonal key recognition have been 
very active fields [2,4,9-11, 13, 15, 18] , and the increas- 
ing popularity of Music Information Retrieval (MIR) with 
applications using mid-level tonal features has established 
chord and key recognitions as useful and challenging tasks 
(see also e.g. the MIREX competitions). 

Since chords and keys are musical attributes closely re- 
lated to each other in western tonal music [8], the idea to 
learn both progressions of a song simultaneously comes 
naturally. In general, such key/chord recognition systems 
are implemented using a HMM-like approach, based on 
a set of features extracted from the audio signal. A well- 
established audio feature for harmonic analysis is the chro- 
magram [6]. It is a 12-dimensional representation of the 
harmonic content of the audio signal segmented into so- 
ca\\ed frames, and it reflects the distribution of energy along 
pitch classes. In this paper the chromagram for the audio 
signal x is denoted as X e K 12xT , with T indicating the 
number of frames. 

An HMM [17] commonly regards chromagrams and an- 
notations as Observed and Hidden variables respectively. 
Let k e A\ xT and c e Al xT be the key and the chord 
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annotations of x, where Ak and A c represent the alpha- 
bets of keys and chords respectively. HMMs can then be 
used to formalize a probability distribution P(k, c, X|0) 
jointly for the chromagram feature vectors X and the an- 
notations, with 9 representing the parameters of this dis- 
tribution. Given an HMM with optimal parameters 8*, the 
key /chord recognition task is equivalent to finding {k* , c* } 



that maximize the joint probability {k* 



c*} = argmaxP(k,c,X|9*). 
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Figure 1. The learning procedure (via Approach B) of the 
proposed Harmony Progression (HP) system. The blocks 
in red show the novelties of the system. 

Some existing key/chord recognition systems are based 
on Machine Learning (ML), where parameters are learned 
from a fully annotated training data set of features, keys 
and chords: {X,K,C} = {X™ € R 12xT ",k™ g _4^ xT '\c" 
•A-c n }n=i (Approach B in Figure[TJ [9]. However, most 
approaches are based at least partially on expert knowl- 
edge, where parameters are set on the basis of music the- 
oretic knowledge of the developers (Approach A in Figure 
[TJ [2, 10, 11, 13, 15, 18]. For example, the key and chord 
transition parameters are set by hand, usually informed by 
perceptual key-to-key and chord-to-key relationships [8]. 
This contrasts with a clear tendency in Artificial Intelli- 
gence research to move away from systems based on ex- 
pert knowledge to ML systems, e.g. in speech recognition, 
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Figure 2. The HMM topology of the HP system. The 
probabilities in red are parameters of the system, which 
are learnt via maximum likelihood estimation (MLE). 



rately represent human perception of the audio's spectral 
content. Although there is an alternative chromagram that 
claimed to model human auditory sensitivity [16], the pro- 
posed framework is very primitive. The chromagram still 
uses spectrum as pitch energy and it just utilizes an arc- 
tangent function to mimic pitch perception without any rig- 
orous reference. In fact, the empirical study in [5] showed 
that loudness is approximately linearly proportional to so- 
called sound power level, defined as log 10 of power spec- 
trum. Therefore, we developed a novel loudness based 
chromagram, which uses the log 10 scale of power spec- 
trum. Mathematically, a sound power level (SPL) matrix is 
of the form 



C Stt = W\o£ 



Pref 



S — 1, . . . , b, t — 1, . . . , 1 , 



where p re f indicates the fundamental reference power and 



machine translation, computer vision, etc. We start from 
the premise that the key/chord recognition task is not dif- 
ferent and propose the Harmony Progression (HP) system 
for recognizing keys/chords from audio relying purely on 
ML techniques. The HP system is trained as illustrated 
in Figure Q] (Approach B) and the detailed HMM topol- 
ogy is depicted in Figure [2] Generally speaking, it is a 
simultaneous key/chord predictor that also identifies bass 
notes, going beyond most of the existing key /chord recog- 
nition systems [2,9, 10, 13, 15, 18]. To our knowledge, the 
only system sharing a similar HMM topology is the expert 
knowledge based system proposed in [11] - the musical 
probabilistic model. 

Compared with the MP system, the proposed HP system 
incorporates two additional major breakthroughs. Firstly, 
it utilizes a novel chromagram extraction method, supported 
with a well-founded physical interpretation. Secondly, our 
system is shown to be fast and memory-efficient in a case 
study. It also achieves an excellent tradeoff between per- 
formance and processing time in our experiments. 

2. SYSTEM DESCRIPTION 

2.1 Loudness based chromagram 

Let x = [xi, . . . , Xt] be an audio signal with Xt indi- 
cating the sample data of the i-th frame, then the chro- 
magram extraction assigns attributes (e.g. power or ampli- 
tude) X e R SxT to a set of frequencies F = {/i, . . . , /g} 
such that X reflects the energy distribution of the audio 
along these frequencies. In order to capture musically rel- 
evant information, the frequencies are selected from the 
equal-tempered scale, which may be tuned [7] and vary be- 
tween songs. Popular implementations of chromagram ex- 
traction are fixed bandwidth Fourier [6] and constant Q[l] 
transforms. 

The above two chromagram systems represent the salience 
of pitch classes in terms of a power or amplitude spec- 
trum. We note however that perception of loudness is not 
linearly proportional to the power or amplitude spectrum, 
and hence such chromagram representations do not accu- 
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is a constant Q transform with a frequency dependent band- 
width L s — O^M M and the hamming window w n [1]. 

Furthermore, low/high frequencies require higher sound 
power levels for the same perceived loudness as mid-fre- 
quencies [5], To compensate for this, we propose to use 
A-weighting [20] to transform the SPL matrix into a repre- 
sentation of the perceived loudness of each of the pitches: 

C'^ = C s , t + A{I s ), s = l,...,S,t = l,...,T, 

where 

R (f\ _ 12200 2 -/ a 4 

(/ 2 +20.6 2 )-^(/? + 107.7 2 )(/ 2 +737.9 2 H/ 2 + 12200 2 ) 

A(f s ) = 2.0 + 20 log 10 (R A (/.)). 

It is known that loudnesses are additive if they are not 
close in frequency [19]. This allows us to sum up loudness 
of sounds on the same pitch class, yielding: 

s 
K,t = E S ( M (fs),p)£' s ,t, P=l,-..,12,t=l,...,T. 

3=1 

Here 5 denotes an indicator function and 



M{f s ) = 



121og 2 



0.5 



69 mod 12 



with ] a denoting the reference frequency of the pitch AA 
(440Hz in standard pitch). Finally, our loudness-based 
chromagram, denoted X Pt t, is obtained by normalizing X' t 
using: 

X pt - min p > X p , t 

maxp' X p , t - miry X p , t 

Note that this normalization is invariant to the reference 
power and hence a specific p re f is not required. 



1 Q is a constant resolution fact which can be tuned by the cross- 
validation technique and SR is the sampling rate of the audio signal. 



2.2 HP HMM topology 



2.3 Search space reduction 



The HP HMM topology consists of three hidden and two 
observed variables. The hidden variables correspond to the 
key /C, the chord C and the bass annotations B = {b n <G 
A b x ™}^Li- Under this representation, a chord is decom- 
posed into two aspects: chord label and bass note. Take the 
chord A:maj/3 for example, the chord state is c = A:maj 
and the bass state is b — C#. Accordingly, the observed 
chromagrams are decomposed into two parts: the treble 
chromagram X c which is emitted by the chord sequence c 
and the bass chromagram X b which is emitted by the bass 
sequence b. The reason of applying this decomposition is 
that different chords can have the same bass note, resulting 
in similar chromagrams in low frequency domain. 

Under this framework, the set 8 of a HP HMM has the 
following parameters 

8 = {pi(ki),pi(ci),pi(bi),p t (kt\kt-i),Pt(ct\ct-i, h), 
p t (6 t |c t ),p t (6 t |6 t _i),p e (X?|c t ),Pe(XJ , |6t)}. 
where pi, p t and p e denote the initial, transition and emis- 
sion probabilities respectively. The joint probability of the 
feature vectors {X c ,X b } and the corresponding annota- 
tion sequences {k, c, b} of a song is then given by the for- 
mula 2 1 

p(x c ,x b ,k,c,b|e)=p l (fc 1 )p l ( Cl ) K (6 1 )flft(fc t |fc t -i) 

Pt(ct\ct-i,kt)p e ('X.t\ c t)Pt(bt\ct)pt(bt\bt-i)Pe('y^\bt)- 

The initial probabilities Pi(*) can be learnt via maxi- 
mum likelihood estimation (MLE). For example, pi(c) = 
*^~ c > Vc e A c , where # indicates the number of. 

For the transitions, p t (c\c,k) represents the probabil- 
ity of a chord change under a certain key. Since the chord 
transition is strongly influenced by the underlying key [13], 
this probability is modelled as key dependent. Under the 
assumption that relative chord transitions are key indepen- 
dent, we transposed all sequences to a common key k and 
learn p t (c\c,k) from the transposed sequences. This al- 
lowed us to get 12 times as much information from the 
data source and the MLE solution is 



Pt(c\c,k) 



#(c t = c & Ct-i = c&k t = k) 
E c > #( c * = c ' & ct-i = c & h = k) 



, Vc, c, k. 



Similarly, p t (k\k)is applied to model key changes during a 
song. pt{b\c) models the probability of a bass note under a 
chord label so as to capture chord inversions. A transition 
link pt{b\b) is also added, with the purpose of modelling 
the continuity of bass notes and capturing ascending and 
descending bassline progressions. These parameters are 



Given the optimal parameters 8* via MLE, the decoding 
task can be formalized as the computation of the key, chord 
and bass sequences {k*,c*,b*} that maximize the joint 
probability {k*,c*,b*} = argmaxP(X c , X b ,k, c,b|8*). 

k,c,b 

This task can be solved using the Viterbi algorithm [17], 
whose computational complexity is O ( | Ak | 2 1 A c \ 2 \ Ab | 2 1 T | ) . 
This is a huge search space, especially when one would 
like to use a large chord vocabulary [11]. In order to re- 
duce the decoding time, we propose three constraints on 
the search space: 

2.3.1 Key transition constraint 

Music theory dictates that not all key changes are equally 
likely. If a song does change key, the modulation is most 
likely to move to a related key [8]. Thus, we suggest to rule 
out a priori the key transition that are seen the least often in 
the training set. Formally, this can be done by constraining 
the key transition probability as 



#(fc t =fcfefc t _i=fc) 



Vk,k€ 



learnt via MLE, e.g. p t (k | k) 

A k - 

Finally, emission probabilities p e (X£ | c t ) and p e (X b | b t ) 
are modelled as 12-dimensional Gaussians, of which the 
mean vectors and covariance matrices are learnt via MLE 
as well. 



2 Note that we use Pt(h\bt-l,ct) = pt(bt\ct)pt(h\h-l), which 
from a purely probabilistic perspective is not correct. However, this sim- 
plification reduces computational and statistical cost and results in better 
performance in practice. 



p' t (k\k) = 



p t (k\k) if#(fct=A&Ai_i = fc) >7 







otherwise 



where 7 is a positive integer indicating the threshold. 

2.3.2 Chord to bass transition constraint 

Similar to the key transition constraint, we can also con- 
strain the chord to bass transitions. A constraint is imposed 
on pt(b\c) such that the bass notes can only be one of r 
(r < 12) candidates for a given chord. The frequencies of 
each chord-to-bass emission are ranked and only the most 
common r are permissible. Mathematically: 



p'Mc) = 



Pt (b\c) if b is one of the top r bass notes for c 







otherwise 



When t = 3, the constraint is equivalent to using root po- 
sition, first and second inversions of a chord. 

2.3.3 Chord alphabet constraint (CAC) 

It is unlikely that all chords will be used in a single song. 
Therefore, if it is possible to find out which chords are used 
in a song, we will be able to constrain the chord alphabet 
without loss of performance. One heuristic method is to 
utilize two-stage predictions. In particular, using a simple 
HMM with only chords as the hidden chain, we first apply 
a max-Gamma decoder [17] to a song and obtain the most 
probable chords A' c . Then, we force the HP HMM chord 
transition probability to be zero for chords that are absent 
in this output: 

Pt(c\c, k) if c,c e A' c 
otherwise 



p' t (c\c,k) 



3. EXPERIMENTS 

3.1 Audio dataset and ground truth annotations 

The audio dataset used is the one used in the MIREX Chord 
Detection task 201(0 which contains 217 songs. The 

|http : //www.music-ir . org/mirex/wiki/2010 : Audio_Chord_Estii 



ground truth key and chord annotations were obtained from 
|http; //isophonics .net[ while the bass notes are 
extracted directly from the ground truth chord annotations. 

3.2 Preprocessing and chromagram feature extraction 

As shown in Figure [1] we first converted our signals to 
mono 11025 Hz, and separated the harmonic and percus- 
sive elements with the Harmonic/Percussive Signal Sepa- 
ration algorithm (HPSS) [14]. After tuning [7] we com- 
puted loudness based chromagrams for each song. The 
frequency range of the bass chromagram was A\ to G$S 
(55Hz - 207.65Hz), and that of the treble chromagram was 
A3 to GJJ6 (220Hz - 1661.2Hz). Finally, we estimated 
beat positions using the beat tracker presented in [3] and 
took the median chromagram feature between consecutive 
beats. We also beat synchronized our key /chord/bass anno- 
tations by taking the most prevalent labels between beats. 
The median feature vector with the corresponding beat- 
synchronized annotations is then regarded as one frame. 

3.3 Major/minor chord prediction 

In this experiment, we used a full key alphabet (12 major 
and 12 minor keys), but restricted ourselves to a chord al- 
phabet of 25 chords (12 major, 12 minor and no-chord). 
There were 13 bass states corresponding to the 12 pitch 
classes as well as a 'no bass'. In accordance with the 
MIREX train-test setup, we randomly split 2/3 of songs 
from each album to form the training set, while the remain- 
ing 1/3 were used for testing. The same chord evaluation 
metric used in MIREX competition 2010 (denoted by 'OR' 
and 'WAOR1 4 h was applied to report chord prediction per- 
formance. Meanwhile, to evaluate the performance of key 
and bass predictions, the accuracy of predominant key pre- 
dictior|_i| (denoted by 'key-P') and the frame-based bass 
accuracy (denoted by 'F-acc') were also reported. The ex- 
periment was repeated 102 times to access variance. 

To compare chord and bass predictions, two HMM- Viterbi 
systems (denoted as HMM-C and HMM-B) are taken as 
baselines. For HMM-C, the observed variable is a con- 
catenation of treble and bass chromagrams and the hidden 
states are 25 chords; in HMM-B only bass chromagram is 
used as the observation and the hidden states are 13 bass 
notes. Finally to compare key predictions, the performance 
of a key-specific HMM [9] (denoted as K-HMM) is also 
reported. 

Table Q] shows the results and the significance of the 
improvement of the HP system over the other systems as- 
sessed using a paired t-test. The first row shows the re- 
sults of the HMM- Viterbi chord prediction system using 
loudness based chromagram. This simple system already 
outperforms the best train-test system presented in MIREX 
2010, whose results are 74.76% (OR) and 73.37% (WAOR0 



System 


Chord 


Key 


Bass 


OR [%] WAOR [%] 


key-P [%] 


F-acc [%] 


HMM-C 

HMM-B 

K-HMM 

HP 


77.82** 77.22** 

N/A N/A 
78.22** 77.62** 
79.37 78.82 


N/A 

N/A 

76.88* 

77.36 


N/A 
73.62** 

N/A 
83.81 


HP-P 


81.52 81.37 


83.33 


85.15 



4 'OR' refers to chord overlap ratio in MIREX 2010 evaluation and 
'WAOR' refers to chord weighted average overlap ratio. 

5 Like in [9, 13], we regard the first key in the ground truth key se- 
quence as the predominant key of this song, while the predicted predom- 
inant key will be the most prevalent key in the key prediction. 



Table 1. Performances for the baseline, key-specific HMM 
and HP systems on the major/minor chord prediction task. 
Bold numbers indicate the best results. The improvement 
of HP is significant at a level < 10~ 40 and < 10" l over 
the performances marked by ** and * respectively. The last 
line also shows the training set performance of HP. 

verifying the effectiveness of the novel loudness based chro- 
magram extraction. Table Q] also indicates that increasing 
the complexity of models helps harmonic estimation, and 
that the HP system achieves the best performance on all 
evaluations. 

To compare with the MIREX pre-trained systems, we 
trained and then tested our system on the whole dataset 
(denoted by HP-P). This provides an upper bound of per- 
formance the HP system can achieve, although of course 
is subject to overfitting the data. Compared with the best 
pre-trained system (namely MD1) presented in MIREX 
2010, the results of which are 80.22% (OR) and 79.45% 
(WAOR), our pre-trained system achieves > 1% improve- 
ment. Unfortunately we are unable to do a paired t-test on 
the results since we do not have their detailed prediction 
on each song. 

Finally we investigated the proposed search space re- 
duction techniques. Figure [3] (a) shows that using a rea- 
sonable cutoff 7 can reduce the decoding time dramatically 
while retaining a high performance. The same trend is also 
observed when applying a reasonable r to the chord to bass 
transition constraint (red dot curves in Figure 0(b)). Fur- 
thermore, using a chord alphabet constraint (solid curves 
in Figure [3] (b)) did not decrease the performance (in fact 
it had a slight improvement), although the decoding time 
is also reduced. To summarize, by applying all these tech- 
niques, we are able to speed up decoding without decreas- 
ing the performance. Thanks to this, we can also apply HP 
to more complex chord representations in the next subsec- 
tion. 

3.4 Full chord prediction 

Here we applied the proposed system to a chord recogni- 
tion task using the chord dictionary used in [11], with 12 
root notes and 11 chord type^Zi resulting in 121 unique 
chords. To the best of our knowledge current systems that 
can handle this vocabulary are the musical probabilistic 
model (denoted by MP) [1 1] and Chordino [12]. 

We first compared the processing time and memory con- 
sumption of two songo between our system and the state- 



7 maj, min, maj/3, maj/5, maj6, maj7, min7, 7, dim, aug and 'N'. 



' The results are quoted from http : //nema .lis.illinois .edu/nema^TIWmfoTmafifldts/ctUBtedlfbHi^^tyXpsgffiffiB^-Y -html 
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Figure 3. The performances and decoding times of HP 
using different search space reductions. The experiments 
in (a) were done without chord alphabet constraint and r is 
fixed at 4. In (b), 'CAC refers to chord alphabet constraint 
and the experiments were carried out with 7 fixed at 10. 



of-the-art MP model (Table |2). Encouragingly, HP con- 
sumes less memory and is faster, even using a slower CPU. 





Processing time (s) 


Peak memory (G) 


HP MP 


HP MP 


Song 1 
Song 2 


58 131 
171 345 


0.48 6 
1.20 15 



Table 2. The comparison of processing time and memory 
consumption between the HP and MP systems. Song 1 is 
"Ticket to Ride" (190s) and Song 2 is "I Want You (She's 
So Heavy)" (467s). The MP results were performed on a 
computer running CentOS 5.3 with 8 Xeon X5577 cores at 
2.93GHz, 24G RAM. HP was run on a CentOS 5.6 com- 
puter with Intel (R) X5650 cores at 2.67GHz, 24G RAM. 

Since MP is not publicly available, we instead com- 
pared HP to Chordino [12] (denoted by CH) which uses the 
same NNLS chroma features as MP but a simpler model. 
Comparing with CH also seems more appropriate because 
its computation/memory cost is more reasonable and in 
line with HP. For HP, the parameters r and 7 are fixed 
at 3 and 10. All other parameters are trained using the 
whole dataset (denoted by HP-P). To assess generalization 
ability, we also computed the leave-one-out error for HP 
(denoted by HP-L). We used 3 performance metrics: chord 



precision (CP), which scores 1 if the ground truth and pre- 
dicted chords are identical and otherwise (e.g. the score 
between A:maj/3 and A:maj is 0); note-based chord pre- 
cision (NCP), which scores 1 if all notes are identical be- 
tween ground truth and predicted chords and otherwise 
(e.g. the score between A:maj/3 and A:maj is 1 but that be- 
tween A:maj and A:maj7 is 0), and the MIREX 'WAOR' 
evaluation. All evaluations are performed with 1ms sam- 
pling rate, as used in MIREX 2010 competition. Tests were 
done on a MAC with an Intel Duo Core 2.4G CPU and 4G 
RAM. 

Table [3] shows a very large improvement over the base- 
line CH, even on the MIREX-style evaluation. Moreover, 
the full chord HP-P system achieves a further improvement 
on WAOR over the HP-P in the major/minor chord pre- 
diction task, again indicating that increasing the complex- 
ity of models helps harmonic estimation. Meanwhile, we 
found the cause of the low performance of CH is that it pre- 
dicted many complex chords (notably 7ths). This is a good 
strategy for the MIREX evaluation, that only measures the 
overlap recall between notes in predicted and ground truth 
chords. However, it does adversely affect the performances 
measured using CP and NCP. Comparing the processing 
time, our system is slightly slower due to the separate cal- 
culation of bass and treble chromagrams. However, the 
decoding process is very fast and thus the system is still 
easy to apply to real world harmonic analysis tasks. 



System 


CP [%] 


NCP [%] 


WAOR [%] 


CH 
HP-L 
HP-P 


50.31 
63.63 
70.26 


52.35 
65.24 
71.96 


76.94 
81.05 
82.98 



System 


Processing time (s) 


Feature extraction 


Decoding 


CH 


9511 


HP 


12756 


818 



Table 3. Performance (top) and processing time (bottom) 
for the baseline and HP systems on the full chord predic- 
tion task. Bold numbers refer to the best results. Note that 
for the CH system only the whole processing time is avail- 
able. 



4. CONCLUSIONS AND FUTURE WORK 

In this paper we propose a novel key, chord and bass simul- 
taneous recognition system - the HP system - that purely 
relies on ML techniques. The experimental results verify 
that the HP system can achieve the state-of-the-art perfor- 
mance on chord recognition, and it can be sped up signifi- 
cantly using the search space reduction techniques without 
severely decreasing the performance. 

HP uses a novel chromagram extraction method, which 
is inspired by loudness perception studies and achieves bet- 
ter recognition performance. Secondly, HP purely relies 
on ML techniques, which provides more flexibility in its 
applications and promises further improvements if more 



data becomes available. Finally, HP achieves an excellent 
tradeoff between performance and processing time, mak- 
ing it applicable to real world harmonic analysis tasks. 

For future work, we aim to improve the processing time 
for chromagram extraction. This can be done by mov- 
ing to faster programming languages such as C and C++. 
We will also move towards discriminative approaches us- 
ing the same HMM topology, which might lead to a more 
robust and powerful harmonic analysis tool. 
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